Machine Learning Final Project

Authors
Affiliation

Adam Lodrik

University of Lausanne

Favre Stefan

Macaraeg Jeff

Published

May 17, 2024

Abstract

The following machine learning project focuses on…

1 Introduction

• The context and background: course, company name, business context.

During our 1st master year as students in Management - orientation Business Analytics, we have had the opportunity to attend some lectures of Machine Learning for Business Analytics. In content of this class, we have seen multiple machine learning techniques for business context, mainly covering supervised (regressions, trees, support vector machine, neural networks) and unsupervised methods (clustering, PCA, FAMD, Auto-Encoder) but also other topics such as data splitting, ensemble methods and metrics.

• Aim of the investigation: major terms should be defined, the question of research (more generally the issue), why it is of interest and relevant in that context.

In the context of this class, our group have had the opportunity to work on an applied project. From scratch, we had to look for some potential dataset for using on real cases what we have learned in class. Thus, we had found an interesting dataset concerning vehicule MPG, range, engine stats and more, for more than 100 brands. The goal of our research was to predict the make (i.e. the brand) of the car according to its characteristics (consumption, range, fuel type, … ) thanks to a model that we would have trained (using RF, ANN or Trees). As some cars could have several identical characteristics, but could differentiate on various other ones, we thought that it would be pertinent to have a model that was able to predict a car brand, from its features.

• Description of the data and the general material provided and how it was made available (and/or collected, if it is relevant). Only in broad terms however, the data will be further described in a following section. Typically, the origin/source of the data (the company, webpage, etc.), the type of files (Excel files, etc.), and what it contains in broad terms (e.g. “a file containing weekly sales with the factors of interest including in particular the promotion characteristics”).

The csv dataset has been found on data.world, a data catalog platform that gather various open access datasets online. The file contains more than 45’000 rows and 26 columns, each colomn concerning one feature (such as the year of the brand, the model, the consumption per barrel, the highway mpg per fuel type and so on).

• The method that is used, in broad terms, no details needed at this point. E.g. “Model based machine learning will help us quantifying the important factors on the sales”.

Among these columns, we have had to find a machine learning model that could help us quantify the importance of the features in predicting the make of the car. Various models will be tried for both supervised and unsupervised learnings.

• An outlook: a short paragraph indicating from now what will be treated in each following sections/chapters. E.g. “in Section 3, we describe the data. Section 4 is dedicated to the presentation of the text mining methods…” In the following sections, you will find 1st the description in the data, then in Section 2 the method used, in Section 3 the results, in Section 4 our conclusion and recommendations and finally in Section 5 our references. From now on, we will go through different sections. Section 2 will be dedicated in the data description in more depth, mentioning the variables and features, the instances, the type of data and eventually some missing data patterns. Then, the next section will cover Exploratory Data Analysis (EDA), where some vizualisations will be made in order to better perceive some patterns in the variables as well as potential correlation. After that, section 4 will be about the methods which will first be divided between Supervised and then Unsupervised in order to find a suitable model for our project. The results will be discussed right after and we will proceed with a conclusion, as well as recommendations and discussions. Finally, the references and appendix will be visible at the end of the report.

2 Data description

For this project, we selected a dataset focused on vehicle characteristics, available as a .csv file from data.world. You can access the dataset via the following link: data.world. It includes a total of 26 features describing 45,896 vehicle models from 141 brands released between 1984 and 2023. Below is a table providing an overview of the available features and their descriptions. You can find a deeper description of the data in ?@sec-Annex. The analysis of the missing values will be dealt with in the data cleaning section.

2.0.1 Description of the features

Variable Name Explanation
ID Number corresponding to the precise combination of the features of the model
Model Year Year of the model of the car
Make Brand of the car
Model The model of the car
Estimated Annual Petroleum Consumption (Barrels) Consumption in Petroleum Barrels
Fuel Type 1 First fuel energy source, only source if not an hybrid car
City MPG (Fuel Type 1) Consumption of the car in miles per gallon of fuel when driving in a city, for fuel type 1
Highway MPG (Fuel Type 1) Consumption of the car in miles per gallon of fuel when driving on a highway, for fuel type 1
Combined MPG (Fuel Type 1) Combined city and highway car consumption in miles per gallon, for fuel type 1
Fuel Type 2 Second energy source if hybrid car
City MPG (Fuel Type 2) Consumption of the car in miles per gallon of fuel when driving in a city, for fuel type 2
Highway MPG (Fuel Type 2) Consumption of the car in miles per gallon of fuel when driving on a highway, for fuel type 2
Combined MPG (Fuel Type 2) Combined city and highway car consumption in miles per gallon, for fuel type 2
Engine Cylinders Number of cylinders of the car
Engine Displacement Measure of the cylinder volume swept by all of the pistons of a piston engine, excluding the combustion chambers
Drive Type of powertrain distribution system that places rotational propulsion, such as rear-wheel,4-Wheel Drive,...
Engine Description Description of some features of the car, such as turbo engine, Stop-Start system, ...
Transmission Manual/Automatic transmission, with number of gears and/or model of transmission
Vehicle Class Type of vehicle, such as Minivan, Trucks,...
Time to Charge EV (hours at 120v) Number of hours required to fully charge an EV car at 120v
Time to Charge EV (hours at 240v) Number of hours required to fully charge an EV car at 240v
Range (for EV) Maximum number of miles possible with a fully charged EV car
City Range (for EV - Fuel Type 1) Maximum number of miles possible with a fully charged EV car in a city
City Range (for EV - Fuel Type 2) Maximum number of miles possible while only using electricity with a fully charged hybrid car in a city
Hwy Range (for EV - Fuel Type 1) Maximum number of miles possible with a fully charged EV car on a highway
Hwy Range (for EV - Fuel Type 2) Maximum number of miles possible while only using electricity with a fully charged hybrid car on a highway

3 EDA

Now that we have presented the variables contained in the dataset, let’s try to understand the data structure, characteristics and underlying patterns thanks to an EDA.

3.0.1 Dataset overview

3.0.2 Columns description

Let’s have a quick look at the characteristics of the columns. You will find more statistical details about it in the annexe. ::: {.cell}

:::

Show the code
# Load necessary packages
library(dplyr)
library(knitr)
library(kableExtra)
# Assuming your dataset is named 'data'

# Get the number of rows and columns
num_rows <- nrow(data)
num_cols <- ncol(data)

# Get the frequency of column types
col_types <- sapply(data, class)
type_freq <- table(col_types)
char_count <- ifelse("character" %in% names(type_freq), type_freq["character"], 0)
num_count <- ifelse("numeric" %in% names(type_freq), type_freq["numeric"], 0)

# Create a summary data frame
summary_df <- data.frame(
  Name = "data",
  Number_of_rows = num_rows,
  Number_of_columns = num_cols,
  Character = char_count,
  Numeric = num_count,
  Group_variables = "None"
)

# Display the summary using kable
kable(summary_df, format = "html", table.attr = "style='width:50%;'") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Name Number_of_rows Number_of_columns Character Numeric Group_variables
data 45896 26 8 18 None

The dataset that we are working with contains approx. 46’000 rows and 26 columns, each row representing a model from one of the 141 brands. From the data overview, we can see that most of our features are concerning the consumption of the cars. If we now check more in details in the annex, we notice that some variables contain a lot of missing and that the variable “Time.to.Charge.EV..hours.at.120v.” is only containing 0s. We will handle these issues in the section “Data cleaning”.

3.0.3 Exploration of the distribution

Now let’s explore the distribution of the numerical features.

Show the code
#  melt.data <- melt(data)
# 
#  ggplot(data = melt.data, aes(x = value)) + 
# stat_density() + 
# facet_wrap(~variable, scales = "free")
plot_histogram(data)# Time.to.Charge.EV..hours.at.120v. not appearing because all observations = 0 

As the majority of models in our dataset are neither electric vehicles (EVs) nor hybrid cars and because of the nature of some column concerning only these two types of vehicles, the results are showing numerous zero values in several columns. This issue will be addressed during the data cleaning process. Additionally, certain features, such as “Engine Cylinders,” are numerically discrete, as illustrated in the corresponding plot.

3.0.4 Outliers Detection

In order identify occurences that deviate significantly for the rest of the observations, and in order to potentially improve the global quality of the data, we have decided to analyse outliers thanks to boxplots. Here are the result on the numerical columns of the dataset:

Show the code
#tentative boxplots

data_long <- data %>%
  select_if(is.numeric) %>%
  pivot_longer(cols = c("ID", 
                        "model_year", 
                        "estimated_Annual_Petrolum_Consumption_Barrels", "City_MPG_Fuel_Type_1",
                        "highway_mpg_fuel_type_1", 
                        "combined_MPG_Fuel_Type_1", 
                        "City_MPG_Fuel_Type_2", 
                        "highway_mpg_fuel_type_2", 
                        "combined_MPG_Fuel_Type_2", 
                        "time_to_Charge_EV_hours_at_120v_", 
                        "charge_time_240v", 
                        "range_for_EV", 
                        "range_ev_city_fuel_type_1", 
                        "range_ev_city_fuel_type_2", 
                        "range_ev_highway_fuel_type_1", 
                        "range_ev_highway_fuel_type_2"), names_to = "variable", values_to = "value")

ggplot(data_long, aes(x = variable, y = value, fill = variable)) +
  geom_boxplot(outlier.size = 0.5) +  # Make outlier points smaller
  facet_wrap(~ variable, scales = "free_y") +  # Each variable gets its own y-axis
  theme_minimal() +
  theme(legend.position = "none",  # Hide the legend
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 0),strip.text = element_text(size = 7)) +  # Rotate x-axis labels
  labs(title = "Outlier Detection Boxplots", x = "", y = "Values")

Most of our boxplots are showing extreme values. Again, this is due to the small amount of EV and hybrid cars in our dataset compared to the rest of the models and due to the nature of some features, concerning only those type of vehicles. ::: {.cell}

:::

3.0.5 Number of models per make

Now let’s check how many models per make we have in our dataset. In order to have a clear plot, we have decided to keep the top 20 brands among all the make on the graph. All the remaining makes are accessible on the table just below.

Show the code
#Number of occurences/model per make 
nb_model_per_make <- data %>%
  group_by(make, model) %>%
  summarise(Number = n(), .groups = 'drop') %>%
  group_by(make) %>%
  summarise(Models_Per_Make = n(), .groups = 'drop') %>%
  arrange(desc(Models_Per_Make))

#makes with only 1 model 
only_one_model_cars <- nb_model_per_make %>% 
  filter(Models_Per_Make == 1 )

#table globale
datatable(nb_model_per_make,
          rownames = FALSE,
          options = list(pageLength = 5,
                         class = "hover",
                         searchHighlight = TRUE))
Show the code
# Option to limit to top 20 makes for better readability
top_n_makes <- nb_model_per_make %>% top_n(20, Models_Per_Make)

#plot
ggplot(top_n_makes, aes(x = reorder(make, Models_Per_Make), y = Models_Per_Make)) +
  geom_bar(stat = "identity", color = "black", fill = "grey", show.legend = FALSE) +
  labs(title = "Models per Make (Top 20)",
       x = "Make",
       y = "Number of Models") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        axis.text.y = element_text(hjust = 1, size = 10),
        plot.title = element_text(size = 14)) +
  coord_flip()  # Flip coordinates for better readability

On the 141 brands, we notice that only 13 brands have more than 100 models in the dataset. Among these, only two of them (Mercedes-Benz and BMW) have more than 400 models presents. Let’s now check the number of make with 1 model in the dataset.

Show the code
#table only 1 model
datatable(only_one_model_cars,
          rownames = FALSE,
          options = list(pageLength = 5,
                         class = "hover",
                         searchHighlight = TRUE))
Show the code
ggplot(only_one_model_cars, aes(x = make, y = Models_Per_Make)) +
  geom_bar(stat = "identity", color = "black", fill = "grey", show.legend = FALSE) +
  labs(title = "Makes with only 1 model in the dataset",
       x = "Make",
       y = "Number of Models") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        axis.text.y = element_text(hjust = 1, size = 10),
        plot.title = element_text(size = 14)) +
  coord_flip()  # Flip coordinates for better readability

Therefore, we can see that Mercendes-Benz and BMW have significantly more models in our dataset, which means that we are dealing with some imbalances in categories. Therefore, we need to be careful when doing predictions, as will may encounter bias toward these two majority classes. Therefore, there are few technics that can be used to deal with this problem, such as resampling technics, Ensemble Methods (RF, Boosting), tuning probability threshold,…..

https://chatgpt.com/c/09a66e4e-80c6-4fbd-bf4e-73a2b3e44afd

4 Data cleaning

In this section we will handle the missing value of our dataset to make sure that we have a clean dataset to perform our EDA and modeling. We will first visualize the missing values in our dataset and then clean the missing values in the columns that we will use for our analysis. We will also remove some rows and columns that are not relevant for our analysis.

Let’s have a look at the entire dataset and its missing values in grey.

We can see that overall, we do not have many missing values in proportion with the size of our dataset. However, we can see that some columns have a lot of missing values. Below we have the detail of the percentage of missing values by columns.

4.1 Dealing with the columns Engine Cylinders and Engine Displacement

As we can see we have missing in 6 columns. Let’s first have a closer look at the engine cylinders and engine displacement columns. They both have 484 missing values. After some data manipulation, we see that these 484 missing are all electric vehicles and that they all have missing values in the engine cylinders and engine displacement columns. Given that in our dataset we have 484 vehicles, we now that theses missing in these column only concerns electric vehicles. This make sense since electric vehicle do not have an combustion engine and therefore those categories are not really applicable. We will therefore replace all missing values in this two columns with “none”.

As we can see, we still have some missing in the columns “Fuel Type 2”, “Engine Description”, “Drive” and “Transmission”. Let’s investigate the missing in the column “Drive”.

4.2 Dealing with the column Drive, Transmission and Engine Description

We decided to drop the brand with more than 10% of missing values in the “Drive” column. After this operation, we also removed the 8 observations that remained with missing values in the “Transmission” column. We decided to drop the column engine description since it contains missings values for more than a third of our observation.

4.3 Final dataset

The dataset is now cleaned and does not contain any missing values. It contains 42240 observations, 18 features and 129 brands. We renamed the columns and stored it in a csv file (data_cleaned.csv). However, for some models, we need to tackle the unbalanced classes in the target variable. For this reason we also created a new csv file for which we drop the make with less than 10 models (data_cleaned_reduced.csv). This dataset contains 42061 observations, 18 features and 66 brands.

Here are the two cleaned datasets on which we are working on from now on.

Cleaned Dataset

Name Number_of_rows Number_of_columns Character Numeric Group_variables
data_cleaned 42240 18 8 5 None

Cleaned and Reduced Dataset

Name Number_of_rows Number_of_columns Character Numeric Group_variables
data_cleaned_reduced 42061 18 8 5 None

5 Classification Tree

In this section we are going to perform a classification tree analysis on the dataset. We start first to load the necessary packages and the dataset. We then have prepared the data by encoding categorical variables and splitted it into training and testing sets. We then pruned the tree with different max_depth values to find the optimal tree depth that balances between training and test accuracy.

5.0.1 Loading R packages, Python libraries and Data Preparation

After importing all the necessary libraries and packages, we first started by loading the dataset and identified make as the target variable. We also encoded categorical variables using Label Encoding to convert them into numerical values.

We then splited the dataset into training (80%) and testing (20%). This step is essential for preventing overfitting, try to select the best model configuration by training the model, then assess the final model performance on our “unseen data” of the testing set. This helps to build a robust model that generalizes well to new data. Unfortunately, we will see later that our model is overfitting, meaning that it performs poorly when seeing new data. We will address this issue later. You will find here an plot showing the actual data splitting effectuated. The x axis represent the number of observations in our dataset and the y axis the number of makes.

5.0.2 Training the Decision Tree without Pruning

Starting to train our Decision Tree, we have first decided to do it without any constraints and see the results of it.

DecisionTreeClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Training Accuracy (without max_depth): 0.8939
Test Accuracy (without max_depth): 0.6979

As we can see, the tree resulting from our training is much complex, takes a lot of time to train and is not interpretable at all. Also, we can see that the accuracy of the training set is higher than the accuracy of the test set, attesting that our model is overfitting.

5.0.3 Training the Decision Tree with Pruning method

In order to have a less complex tree and to fight overfittingness, we have decided to prune our tree and then train it. We chose to prune the tree by trying a few max_depth parameter values to control the tree’s growth (none, 5, 10, 15, 20, 25, 30). We want here to find the optimal tree depth that balances between training and test accuracy. “None” is actually the training without any max_depth parameter. Here are our new results:

max_depth   Training Accuracy   Test Accuracy
5       0.2605      0.2550
10      0.4887      0.4674
15      0.7203      0.6352
20      0.8520      0.6926
25      0.8898      0.6972
30      0.8939      0.6971
None        0.8939      0.6979

The model’s accuracy improved as the tree’s depth increased up to a point, with a max_depth of 25 or 30 providing the best test accuracy up to 70%. We see that reducing the max_depth to 10 or 15 improves the balance between, therefore reduce drastically the case of overfitting but this is at the expense of the accuracy of our model on new data. But we can see that pruning the tree with a max depth of 25 allows us to increase our accuracy from 69.84% to 70% therefore increasing the accuracy of our model and at the same time, it reduce the gap between the test set and the trainig set. In our case, pruning the Decision Tree helps in improving its generalization performance by preventing it from becoming too complex and reduce overfitting the training data.

6 Neural Network

In this section, we will build a neural network model to predict the make of a car based on the features at our disposal. We will preprocess the data, split it into training and testing sets, define the neural network architecture, compile the model, train it and evaluate its performance.

6.1 Preprocessing and splitting the data

The dataset contains different types of data. Some columns are numerical (like “city_mpg_fuel_type_1” or “charge_time_240v”), and some are categorical (“vehicle_class” or “fuel_type”). We identify and separate these two types of columns. Separating numerical and categorical columns is an essential step in data preprocessing because they require different types of handling to prepare them for machine learning algorithms. The numerical columns need to be scaled by adjusting them so they have a mean of zero and a standard deviation of one, which helps the machine learning algorithm perform better. While the categorical columns need to be one-hot encoded which creates a binary column a format that the machine learning model can understand.

The data is split into two parts: training and testing. The training set is used to train the model, and the testing set is used to evaluate its performance. This split ensures that we can test how well the model generalizes to new, unseen data.

6.1.1 Building the neural network models and training them

6.1.1.1 Base Neural Network

We chose to use a neural network. This neural network consists of layers of neurons, where each layer applies transformations to the data. The first layer takes the input features. Then some Hidden layers help the model learn complex patterns. In the end, the output layer predicts the probability of each car manufacturer. The first layes, the input layer, takes the input features. The second layers is set to 128 neurons, the third to 64 neurons and the last layer, the output layer, has as many neurons as there are car manufacturers. The activation function used in the hidden layers is the Rectified Linear Unit (ReLU), and the output layer uses the Softmax activation function. The model is compiled with the Adam optimizer and the categorical crossentropy loss function.

# Define the neural network model
model_no_dropout = Sequential([
    Input(shape=(X_train.shape[1],)),
    Dense(128, activation='relu'),
    Dense(64, activation='relu'),
    Dense(y_train.shape[1], activation='softmax')])

We used activation functions in the hidden layers to introduce non-linearity into the model. The ReLU activation function is used in the hidden layers because it is computationally efficient and helps the model learn complex patterns in the data. The Softmax activation function is used in the output layer because it converts the model’s raw output into probabilities that sum to one. This allows us to interpret the model’s output as the probability of each car manufacturer.

We used the following hyperparameters (non-exhaustive list):

  • epochs: 5 (Corresponds to the number of times the model sees the entire dataset during training.)
  • batch_size: 32 (Corresponds to the number of samples that the model processes before updating the weights.)
  • validation_split: 0.2 (Corresponds to the fraction of the training data to be used as validation data.)

The model is trained for 5 epochs with a batch size of 32. The validation split is set to 0.2, which means that 20% of the training data is used for validation.

Overall, the model perform well but we have’t dealt with the issue of unbalanced classes yet. Let’s have a look at the distribution of the sensitivity and specificity for each class.

6.1.1.1.1 Issue of unbalanced classes

The isse of unbalanced classes, as explained previously can highly weekend the model ability to generalize to new data. The model, will automatically prefer to predict the most frequent classes. We can see in the boxplots below the distribution of the sensitivity and specificity for the classes. Even though, we already dealt in part with the unbalanced class during the cleaning process, as seen in the graph (((((((((((((((add graph reference))))))))))))))), there’s still big differences between the classes.

{'whiskers': [<matplotlib.lines.Line2D object at 0x363c9dc40>, <matplotlib.lines.Line2D object at 0x360e3eed0>, <matplotlib.lines.Line2D object at 0x360e3ec30>, <matplotlib.lines.Line2D object at 0x360e3e450>], 'caps': [<matplotlib.lines.Line2D object at 0x361681850>, <matplotlib.lines.Line2D object at 0x361680230>, <matplotlib.lines.Line2D object at 0x360e3d7c0>, <matplotlib.lines.Line2D object at 0x361680e30>], 'boxes': [<matplotlib.lines.Line2D object at 0x360556a80>, <matplotlib.lines.Line2D object at 0x360e3f080>], 'medians': [<matplotlib.lines.Line2D object at 0x361680950>, <matplotlib.lines.Line2D object at 0x361681370>], 'fliers': [<matplotlib.lines.Line2D object at 0x361680cb0>, <matplotlib.lines.Line2D object at 0x3616819d0>], 'means': []}

By examining the boxplot representing the distribution of sensitivity and specificity across the classes, we observe clear evidence of class imbalance. Sensitivity and specificity are not consistent across the classes. Specificity, which measures how well the model identifies true negatives, is very high for every class. This indicates that the model is effective at detecting instances that do not belong to a given class. However, sensitivity, which measures the true positive rate and reflects how well the model correctly predicts the make of a car for a specific brand, is not as high. This suggests that the model is not performing well for all classes.

For some classes with more vehicle models, the model tends to predict those classes more frequently, leading to higher accuracy but lower sensitivity for rarer classes. To address this issue, we will use class weights to ensure the model performs more evenly across all classes.

6.1.1.1.2 Adding class weights to the model

This technique has been explained in the methods section but in a nutshell, it is a way to penalize the model for misclassifying the minority class more than the majority class. This helps the model to learn the patterns in the minority class better and improve its performance on the test set.

{'whiskers': [<matplotlib.lines.Line2D object at 0x36598e9f0>, <matplotlib.lines.Line2D object at 0x363c69850>, <matplotlib.lines.Line2D object at 0x363c69310>, <matplotlib.lines.Line2D object at 0x36598de80>], 'caps': [<matplotlib.lines.Line2D object at 0x363c681a0>, <matplotlib.lines.Line2D object at 0x363c683b0>, <matplotlib.lines.Line2D object at 0x366b00110>, <matplotlib.lines.Line2D object at 0x366b02960>], 'boxes': [<matplotlib.lines.Line2D object at 0x36598dbb0>, <matplotlib.lines.Line2D object at 0x363c68fb0>], 'medians': [<matplotlib.lines.Line2D object at 0x363c68950>, <matplotlib.lines.Line2D object at 0x363c69100>], 'fliers': [<matplotlib.lines.Line2D object at 0x363c68c80>, <matplotlib.lines.Line2D object at 0x363c696a0>], 'means': []}

As we can see the model is taking better care of the minority classes and overall the sensitivity is higher across the classes. The sensitivity and specificity are more consistent across the classes. The model is better at generalizing to new data. In our case, this method does not eliminate completely the issue of unbalanced classes. Given the structure of our data and the discrepancy of our classes, we will use this technique for the following neural networks and move on.

6.1.1.1.3 Model performance

We can now look at the evolution of the accuracy of our model during the training process with the following plot.

As we can see, at each epoch, the accuracy is increasing and the loss is decreasing. The model is learning from the training data and improving its predictions.

But, in the end, we have a case of overfitting. The model performs well on the training data but not as well on the testing data. This is an issue because it limits the possibility of generalizing the model to new data.

Performance of the model with weighted class:

Final Training Accuracy: 77.31%
Final Validation Accuracy: 72.88%
Test Set Accuracy: 72.33%

Overall, the performance of the model is still good. However the quality can be improved. To address the issue of overfitting, we will introduce Dropout layers in the neural network.

6.1.1.2 Neural Network with Dropout layers

Dropout layers randomly set a fraction of input units to zero during training, which helps prevent overfitting by forcing the model to learn more robust features. We will tune the dropout rate to find the optimal value that balances training and validation accuracy and that insure to reduce overfitting.

We used the following hyperparameters (non-exhaustive list):

  • epochs: 5 (Corresponds to the number of times the model sees the entire dataset during training.)
  • batch_size: 32 (Corresponds to the number of samples that the model processes before updating the weights.)
  • validation_split: 0.2 (Corresponds to the fraction of the training data to be used as validation data.)
  • dropout_rate: varies (Corresponds to the fraction of neurons to drop during training.)

We will try 5 different dropout rates in addition of the case of no dropout. We will train the model with each dropout rate and evaluate its performance on the validation and test sets. We will then plot the training, validation, and test accuracies for each dropout rate to find the optimal value.

# Function to create and compile the model
def create_model(dropout_rate=0.0):
    model = Sequential([
        Input(shape=(X_train.shape[1],)),
        Dense(128, activation='relu'),
        Dropout(dropout_rate),
        Dense(64, activation='relu'),
        Dropout(dropout_rate),
        Dense(y_train.shape[1], activation='softmax')
    ])
    model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    return model

We can see that the model with a dropout rate of 0.2 has the best performance on the test set. This model has a good balance between training and validation accuracy, and it generalizes well to new data. It also eliminate the overfitting issue. We will use this dropout rate of 0.2 to train the final model with dropout layers.

We see that the model with dropout layers performs better that the one without it. We reached a better accuracy on the validation set and the model is not overfitting anymore.

Performance of the model with weighted class and dropout layers:

Final Training Accuracy: 62.43%
Final Validation Accuracy: 65.74%
Test Set Accuracy: 65.59%

Our final accuracy of our model are not as great as we had with our first model but the model we are using is at least better at representing the data and generalizing to new data.

Show the code
source(here::here("scripts","setup.R"))
library(data.table)
data_cleaned <- fread(here::here("data", "data_cleaned.csv"))

In order to see the link between the features, we can use a dimension reduction technique such as the Principal Component Analysis, aiming to link the features according to their similarities across instances and combine features in fewer dimensions.

7 Principal Component Analysis

7.1 Data Standardization

Show the code
data_prepared <- data_cleaned %>%
  mutate(across(where(is.character), as.factor)) %>%
  mutate(across(where(is.factor), as.numeric)) %>%
  scale()  # Standardizes numeric data including converted factors

We begin by standardizing our data, meaning trasnforming our character variables into factors and then the factors into numeric.

7.2 Heatmap

Show the code
cor_matrix <- cor(data_prepared)  # Calculate correlation matrix

# Melt Correlation Matrix
melted_cor_matrix <- melt(cor_matrix)

# Heatmap
ggplot(melted_cor_matrix, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") + 
  geom_text(aes(label = sprintf("%.2f", value)), color = "black", size = 3.5) +  
  scale_fill_gradient2(low = "lightblue", high = "darkblue", mid = "blue", midpoint = 0, limit = c(-1,1),
                       name = "Spearman\nCorrelation") +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5),  
        plot.title.position = "plot") +
  labs(x = 'Variables', y = 'Variables', 
       title = 'Correlations Heatmap of Features')  

We used this heatmap to check the correlation between the variables. As we can see, some variables seem to be strongly correlated, but most of them don’t seem to be too strongly correlated, whether positively or negatively. Let’s now look into the link between the features using a biplot, which combines the observations as well as the features.

7.3 Biplot

Show the code
pca_results <- PCA(data_prepared, graph = FALSE)
summary(pca_results)

Call:
PCA(X = data_prepared, graph = FALSE) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
Variance               4.616   3.735   2.126   1.292   1.019   0.988   0.856
% of var.             25.644  20.748  11.809   7.177   5.660   5.490   4.753
Cumulative % of var.  25.644  46.392  58.201  65.378  71.038  76.527  81.280
                       Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
Variance               0.827   0.765   0.549   0.510   0.349   0.193   0.137
% of var.              4.594   4.250   3.047   2.834   1.941   1.071   0.759
Cumulative % of var.  85.875  90.124  93.172  96.005  97.946  99.017  99.777
                      Dim.15  Dim.16  Dim.17  Dim.18
Variance               0.027   0.008   0.003   0.002
% of var.              0.150   0.045   0.018   0.011
Cumulative % of var.  99.926  99.971  99.989 100.000

Individuals (the 10 first)
                                 Dist    Dim.1    ctr   cos2    Dim.2    ctr
1                            |  3.334 | -1.087  0.001  0.106 |  0.065  0.000
2                            |  3.387 | -0.907  0.000  0.072 | -0.079  0.000
3                            |  3.522 | -1.153  0.001  0.107 | -0.031  0.000
4                            |  2.761 | -1.068  0.001  0.150 |  0.012  0.000
5                            |  2.713 | -0.991  0.001  0.133 | -0.022  0.000
6                            |  2.713 | -0.991  0.001  0.133 | -0.022  0.000
7                            |  2.847 | -1.039  0.001  0.133 | -0.066  0.000
8                            |  2.876 | -1.151  0.001  0.160 | -0.019  0.000
9                            |  4.850 | -1.810  0.002  0.139 |  0.346  0.000
10                           |  3.313 | -1.686  0.001  0.259 |  0.231  0.000
                               cos2    Dim.3    ctr   cos2  
1                             0.000 |  0.018  0.000  0.000 |
2                             0.001 |  2.589  0.007  0.584 |
3                             0.000 |  2.603  0.008  0.546 |
4                             0.000 |  0.658  0.000  0.057 |
5                             0.000 |  0.609  0.000  0.050 |
6                             0.000 |  0.609  0.000  0.050 |
7                             0.001 |  0.490  0.000  0.030 |
8                             0.000 |  0.546  0.000  0.036 |
9                             0.005 | -0.005  0.000  0.000 |
10                            0.005 |  1.551  0.003  0.219 |

Variables (the 10 first)
                                Dim.1    ctr   cos2    Dim.2    ctr   cos2  
make                         |  0.106  0.242  0.011 | -0.092  0.229  0.009 |
model_year                   |  0.347  2.605  0.120 |  0.032  0.028  0.001 |
vehicle_class                | -0.141  0.428  0.020 |  0.051  0.071  0.003 |
drive                        | -0.038  0.031  0.001 |  0.003  0.000  0.000 |
engine_cylinders             |  0.077  0.130  0.006 | -0.073  0.141  0.005 |
engine_displacement          |  0.125  0.338  0.016 | -0.104  0.292  0.011 |
transmission                 | -0.498  5.376  0.248 |  0.069  0.128  0.005 |
fuel_type_1                  | -0.419  3.802  0.175 |  0.223  1.328  0.050 |
city_mpg_fuel_type_1         |  0.837 15.173  0.700 | -0.308  2.532  0.095 |
highway_mpg_fuel_type_1      |  0.800 13.860  0.640 | -0.313  2.630  0.098 |
                              Dim.3    ctr   cos2  
make                         -0.472 10.475  0.223 |
model_year                   -0.120  0.679  0.014 |
vehicle_class                 0.474 10.568  0.225 |
drive                         0.158  1.172  0.025 |
engine_cylinders              0.755 26.797  0.570 |
engine_displacement           0.851 34.040  0.724 |
transmission                 -0.018  0.016  0.000 |
fuel_type_1                  -0.170  1.358  0.029 |
city_mpg_fuel_type_1         -0.242  2.753  0.059 |
highway_mpg_fuel_type_1      -0.351  5.790  0.123 |
Show the code
# Biplot Graph
fviz_pca_biplot(pca_results,
                geom.ind = "point",  
                geom.var = c("arrow", "text"), 
                col.ind = "cos2",  
                gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),  
                repel = TRUE 
)

The biplot shows several information. First, the two dimensions epxlain almost 50% of the total variance of the data. Then, each point represents an observation and its color represent the quality of the representation of the variables. Looking at the cos2 gradient, the redder the dot, the better the quality of representation. Then, the arrows (or vectors) represent the features. The vectors pointing in similar directions represent some positive correlations, whereas the ones going opposite represent negative correlations. Orthogonal variables are not correlated.

Taking all of these into account, we can interpret this graph in the following way: we notice that the variables linking the mpg and the range for a fuel type (e.g. fuel type 1) go in the same direction and all seem to be positively correlated, and they are uncorrelated to the same characteristics of the other fuel type (e.g. fuel type 2). Also, the mpg and range seem to be negatively correlated to their own fuel type. Moreover, fuel type 1 is uncorrelated to fuel type 2, which makes sense. We can then explain that Dimension 1 represents the “non-hybrid” characteristics of vehicles whereas dimension 2 seems more difficult to interpret just with this graph. It can potentially be linked to “hybrid” characteristics of vehicles

Also, looking at the biplot, there seems to be 3 distinct “clusters” in the observations. Thus, we will proceed in a cluster analysis to verify this.

7.4 Screeplot

Show the code
# Use PCA results to generate the screeplot
fviz_eig(pca_results, 
         addlabels = TRUE,  
         ylim = c(0, 100),  
         barfill = "lightblue",  
         barcolor = "black",  
         main = "Scree Plot of PCA") 

Taking the screeplot into account, 7 dimensions are needed to reach at least 80% of the variance, meaning the features might be relatively independent. It is already shown in the biplot above, as most arrows in the middle seem to be shorter and the cos2 are low, meaning that the features might be more linked to other dimensions than the first 2 dimensions.

8 Clustering

8.1 Cluster Selection

Show the code
# Get PCA coordinates
pca_coords_df <- data.frame(pca_results$ind$coord)
set.seed(123) # for reproducibility

# Use Elbow Method
wcss <- vector()
for (i in 1:10) {
  km <- kmeans(pca_coords_df[, 1:3], centers = i)
  wcss[i] <- sum(km$withinss)
}

# Elbow Plot
elbow_data <- data.frame(Clusters = 1:10, WCSS = wcss)
elbow_plot <- ggplot(elbow_data, aes(x = Clusters, y = WCSS)) +
  geom_point() +
  geom_line() +
  geom_text(aes(label = Clusters), vjust = -0.5) +
  ggtitle("Elbow Method for Optimality") +
  xlab("Number of Clusters") +
  ylab("Within-Cluster Sum of Squares (WCSS)")

# Print the elbow plot
elbow_plot

Show the code
# Looking at Elbow, Optimal nb of clusters = 3
optimal_clusters <- 3
km_result <- kmeans(pca_coords_df[, 1:3], centers = optimal_clusters) # KMeans

# Add cluster assignments to PCA coordinates
pca_coords_df$cluster <- km_result$cluster

# Cluster Plot
cluster_plot <- ggplot(pca_coords_df, aes(x = Dim.1, y = Dim.2, color = factor(cluster))) +
  geom_point() +
  ggtitle("Cluster") +
  xlab(paste("Dim1 (", round(pca_results$eig[1, 2], 1), "%)", sep = "")) +
  ylab(paste("Dim2 (", round(pca_results$eig[2, 2], 1), "%)", sep = "")) +
  scale_color_manual(values = c("lightblue", "lightpink", "lightgreen"))

cluster_plot

Using the elbow method for clustering allows to show us the optimal number of cluster. In our case, we seem to have 2 elbows, so the first one indicates the optimality at 3 clusters. We will work with 3 clusters and see what the second elbow represents. Then, we will at the cluster plot which shows the 3 clusters. We will go deeper in the interpretation by looking at the 3D version of the biplot, which includes both the PCA and the clustering.

8.2 3D Biplot

Show the code
# Calculate Cluster Centers
cluster_centers <- aggregate(pca_coords_df[, 1:3], by = list(cluster = pca_coords_df$cluster), FUN = mean)

# Get loadings for variables
loadings <- data.frame(
  variable = rownames(pca_results$var$coord),
  pca_results$var$coord[, 1:3]
)

# Create the 3D scatter plot with clusters and loadings
fig <- plot_ly() %>%
  add_trace(
    data = pca_coords_df,
    x = ~Dim.1,
    y = ~Dim.2,
    z = ~Dim.3,
    color = ~factor(km_result$cluster),
    colors = c("lightblue", "lightpink", "lightgreen"),
    type = 'scatter3d',
    mode = 'markers',
    marker = list(size = 3)
  ) %>%
  add_trace(
    data = cluster_centers,
    x = ~Dim.1,
    y = ~Dim.2,
    z = ~Dim.3,
    text = ~paste("Cluster", cluster),
    type = 'scatter3d',
    mode = 'text',
    textposition = 'top center',
    textfont = list(color = 'black', size = 10)
  )

# Scale factor for loadings arrows
scale.loads <- 10.0

# Add loadings as arrows
for (k in 1:nrow(loadings)) {
  fig <- fig %>%
    add_trace(
      x = c(0, loadings$Dim.1[k]) * scale.loads,
      y = c(0, loadings$Dim.2[k]) * scale.loads,
      z = c(0, loadings$Dim.3[k]) * scale.loads,
      type = 'scatter3d',
      mode = 'lines+markers',
      line = list(width = 4, color = 'blue'),
      marker = list(size = 2, color = 'blue'),
      showlegend = FALSE
    ) %>%
    add_trace(
      x = loadings$Dim.1[k] * scale.loads,
      y = loadings$Dim.2[k] * scale.loads,
      z = loadings$Dim.3[k] * scale.loads,
      text = loadings$variable[k],
      type = 'scatter3d',
      mode = 'text',
      textposition = 'top center',
      textfont = list(color = 'blue', size = 10),
      showlegend = FALSE
    )
}

# Layout
fig <- fig %>%
  layout(
    title = "PCA - 3D Biplot with Features",
    scene = list(
      xaxis = list(title = paste("Dim1 (", round(pca_results$eig[1, 2], 1), "%)", sep = "")),
      yaxis = list(title = paste("Dim2 (", round(pca_results$eig[2, 2], 1), "%)", sep = "")),
      zaxis = list(title = paste("Dim3 (", round(pca_results$eig[3, 2], 1), "%)", sep = ""))
    )
  )

# Display the plot
fig

Looking at this 3D biplot, we can clearly see the 3 different clusters. Cluster 1 seems to englobe the observations from characteristics of the vehicle with fuel type 2, meaning the hybrid cars. Then, cluster 2 links observations from the different car characteristics such as the vehicle class, the engine displacement and cylinders, the drive, the make, the fuel types and the model year. Finally, cluster 3 englobes the characteristics of vehicles with fuel type 1, meaning normal gas cars.

8.3 Silhouette Plot

Show the code
# Set seed for reproducibility
set.seed(123)

# Sample a subset of the data if needed
sample_indices <- sample(1:nrow(pca_coords_df), 30000) # Adjust sample size as needed
sampled_data <- pca_coords_df[sample_indices, 1:3]

# Perform KMeans clustering
optimal_clusters <- 3 # Replace with your optimal number of clusters
kmeans_result <- kmeans(sampled_data, centers = optimal_clusters)

# Create a data frame with the clustering results
clustered_data <- data.frame(sampled_data, cluster = kmeans_result$cluster)

# Define the number of samples per cluster
samples_per_cluster <- min(table(clustered_data$cluster))

# Sample equal number of points from each cluster
equal_sampled_data <- do.call(rbind, lapply(1:optimal_clusters, function(cluster_num) {
  cluster_subset <- clustered_data[clustered_data$cluster == cluster_num, ]
  cluster_sample_indices <- sample(1:nrow(cluster_subset), samples_per_cluster)
  return(cluster_subset[cluster_sample_indices, ])
}))

# Calculate silhouette values for the sampled data
equal_sampled_silhouette_values <- silhouette(equal_sampled_data$cluster, dist(equal_sampled_data[, 1:3]))

# Define the colors for each cluster
colors <- c("lightblue", "lightpink", "lightgreen")

# Plot silhouette values for the sampled data with custom colors
fviz_silhouette(equal_sampled_silhouette_values, palette = colors) +
  ggtitle("Silhouette Plot for Sampled KMeans Clustering") +
  xlab("Cluster") +
  ylab("Silhouette Width")
  cluster size ave.sil.width
1       1  517          0.54
2       2  517          0.12
3       3  517          0.41

In order to provide a silhouette plot, we considered taking only a sample of the dataset, as it would have been to heavy to run otherwise. Here, we can notice that clusters 1 and 3 both seem to have a high silhouette width, indicating that they are well-clustered. The widths are also mainly above the average width (the red line), meaning that there is a good consistency within the cluster. On the other hand, there is more variation in the 2nd cluster, indicating some heterogeneity within the cluster. The cluster contains some negative silhouette widths and a lot of the observations are below the red line, all of these meaning that some observations might be better in another cluster. For simplification purpose, we decide to stay with 3 clusters and we will provide a 3D biplot with 6 clusters in the annex.qmd.

To sum the unsupervised learning part, we can clearly say that there seems to be some factors that are linked together and some observations that can be linked into 3 clusters. These clusters being “Hybrid Vehicles”, “Vehicle Characteristics” and “Non-Hybrid Vehicles” makes sense, as we either have a 1 propulsion type of car or hybrid cars, and the vehicle characteristics being another cluster can be explained in the fact both hybrid and non-hybrid vehicles can share some same vehicle characteristics (not all). As for the features, we need 7 dimensions to explain at least 80% of the total variance, meaning that not all the features are concretely linked together and the links are “moderately strong”.

9 Summary of what was observed from the results

11 Limitations (this may be the subject of a separated section). An opening is often expected here: “What next?” “What can be done for improvement”, etc.?